{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Motif and Transcription Factor enrichment after Mowgli integration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook sums up guidelines to: \n", "\n", "* extract the top features (e.g., genes or peaks) for all dimensions in the Mowgli embedding.\n", "* perform a motif enrichment analysis (using peaks) (as done for Figure 5 of our [manuscript](https://www.nature.com/articles/s41467-023-43019-2)).\n", "* perform a TF enrichment analysis (using genes) using a collection of TF->Targets.\n", "\n", "**NOTE #1:** This notebook uses both R and Python code. We recommend to copy paste in your local Jupyter or Rstudio session and run the code in the corresponding language. \n", "\n", "**NOTE #2:** The enrichments are performed on human data. Change this code and the databases accordingly if you are working with other species. \n", "\n", "**NOTE #3:** It is possible also to match the TF and motif enrichment to get a better viee of the relationship between the transcriptional and epigenetic landscape. We do not cover here this analysis since it's very case-specific " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extract top features of a modality from Mowgli" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We assume that the results of Mowgli integration are in a `mdata` object that store genes in the `rna` and peaks in the `atac` slot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Python\n", "# method to extract the top features\n", "def top_mowgli(dim, n, H_mowgli):\n", " \"\"\"\n", " Get the top n peaks for a given dimension.\n", " \"\"\"\n", " H_scaled = H_mowgli / H_mowgli.sum(axis=1, keepdims=True)\n", " return H_scaled[:, dim].argsort()[::-1][:n]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract the top peaks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Python\n", "# Peaks\n", "\n", "# we set the number of peaks to look at\n", "n_peaks = 100\n", "\n", "# Get the genes or peak dictionaries\n", "H_mowgli_atac = mdata[\"atac\"].uns[\"H_OT\"]\n", "\n", "# actual features extraction\n", "mdata[\"atac\"].var_names = mdata[\"atac\"].var_names.str.replace(\"atac:\", \"\")\n", "top_in_mowgli = mdata[\"atac\"].var.copy()\n", "\n", "# Fill the Mowgli top peaks.\n", "for dim in range(H_mowgli_atac.shape[1]):\n", " col_name = f\"top_in_dim_{dim}\"\n", " idx = top_in_mowgli.index[top_mowgli(dim, n_peaks, H_mowgli_atac)]\n", " top_in_mowgli[col_name] = False\n", " top_in_mowgli.loc[idx, col_name] = True\n", "\n", "# Save Mowgli's top peaks.\n", "top_in_mowgli.to_csv(\"top_peaks_in_mowgli.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract the top genes (for other expression-space based enrchments)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Python\n", "# we set the number of peaks to look at\n", "n_genes = 100\n", "\n", "H_mowgli_rna = mdata[\"rna\"].uns[\"H_OT\"]\n", "\n", "# select the top genes to probe using only the highly variable genes (our universe)\n", "top_in_mowgli = (\n", " mdata[\"rna\"].var.loc[mdata[\"rna\"].var[\"highly_variable\"] == True, :].copy()\n", ") # the var coordinates\n", "\n", "for dim in range(H_mowgli_rna.shape[1]):\n", " print(dim)\n", " # name of the column iun the var object that will be used to extract the top peaks for each gfiven dimenssion\n", " col_name = f\"top_in_dim_{dim}\"\n", " idx = top_in_mowgli.index[\n", " top_mowgli(dim, n_genes, H_mowgli_rna)\n", " ] # indices of the top features for that given dimensions. will be used for localizing the peaks afterwasrds\n", " top_in_mowgli[col_name] = False # set all value for that dimesions to False\n", " top_in_mowgli.loc[idx, col_name] = True # set to True only the peaks that are\n", "\n", "# Save Mowgli's top genes.\n", "top_in_mowgli.to_csv(\n", " os.path.join(top_feats_dir, f\"top_genes_in mowgli.csv\"),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Motif enrichment \n", "\n", "This code was used in the [original publication](https://doi.org/10.1038/s41467-023-43019-2) to perform motif enrichment analysis from chromatin accessibility (Figure 5-C). This notebook is a summarisation of the code that is stored in our [Mowgli reproducibility](https://github.com/cantinilab/mowgli_reproducibility) repository." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# R\n", "# Imports.\n", "library(GenomicRanges)\n", "library(motifmatchr)\n", "library(chromVAR)\n", "library(TFBSTools)\n", "library(JASPAR2022)\n", "library(Signac)\n", "library(BSgenome.Hsapiens.UCSC.hg38)\n", "library(chromVARmotifs)\n", "library(MuData)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# R\n", "# Read atac file.\n", "in_atac <- \"top_peaks_in_mowgli.csv\" # nolint\n", "peaks_csv <- read.csv(in_atac, row.names = 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# R\n", "# Optional: Remove non-canonical chromosomes.\n", "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000194.1\",]\n", "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000205.2\",]\n", "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000205.2\",]\n", "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000219.1\",]\n", "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"GL000219.1\",]\n", "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270721.1\",]\n", "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270726.1\",]\n", "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270726.1\",]\n", "peaks_csv < -peaks_csv[peaks_csv[\"Chromosome\"] != \"KI270713.1\",]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# R\n", "# Convert the peaks to GRanges.\n", "chromosomes <- peaks_csv[\"Chromosome\"][, 1]\n", "ranges <- IRanges::IRanges(\n", " start = peaks_csv[\"Start\"][, 1],\n", " end = peaks_csv[\"End\"][, 1]\n", ")\n", "peaks <- GenomicRanges::GRanges(seqnames = chromosomes, ranges = ranges)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# R\n", "# Get JASPAR motifs.\n", "opts <- list()\n", "opts[\"species\"] <- \"Homo sapiens\"\n", "opts[\"collection\"] <- \"CORE\"\n", "motifs <- TFBSTools::getMatrixSet(JASPAR2022::JASPAR2022, opts)\n", "motifs_pwm <- TFBSTools::toPWM(motifs)\n", "\n", "# Get cisBP motifs.\n", "data(\"human_pwms_v2\")\n", "\n", "# Fuse JASPAR and cisBP motifs.\n", "for (name in names(motifs_pwm)) {\n", " human_pwms_v2[name] <- motifs_pwm[name]\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# R\n", "# Create a 'dummy' Signac object from the peaks.\n", "# Actually giving peaks_csv is nonsense.\n", "# But we only care about the rownames so it's fine.\n", "assay <- Signac::CreateChromatinAssay(\n", " peaks_csv,\n", " ranges = peaks,\n", " sep = c(\":\", \"-\")\n", ")\n", "\n", "# Create statistics about peaks.\n", "assay <- Signac::RegionStats(\n", " object = assay,\n", " genome = BSgenome.Hsapiens.UCSC.hg38\n", ")\n", "\n", "# Add the downloaded motif PWM annotation.\n", "assay <- Signac::AddMotifs(\n", " object = assay,\n", " genome = BSgenome.Hsapiens.UCSC.hg38,\n", " pfm = human_pwms_v2\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# R\n", "# Define where to save the motif enrichment outputs.\n", "out_motif <- \"motifs_\"\n", "\n", "# Get all top peaks.\n", "background <- c()\n", "for (dim in 0:49) {\n", "\n", " # Get the top peaks for that dimension.\n", " features <- rownames(assay)[peaks_csv[paste0(\"top_in_dim_\", dim)] == \"True\"]\n", "\n", " background <- c(background, features)\n", "}\n", "\n", "# Iterate over Mowgli's dimensions.\n", "for (dim in 0:49) {\n", "\n", " # Get the top peaks for that dimension.\n", " features <- rownames(assay)[peaks_csv[paste0(\"top_in_dim_\", dim)] == \"True\"]\n", "\n", " # Do motif enrichment analysis.\n", " enriched_motifs <- Signac::FindMotifs(\n", " object = assay,\n", " features = features,\n", " background = background\n", " )\n", "\n", " # Save the enrichment.\n", " write.csv(enriched_motifs, paste0(out_motif, dim, \".csv\"))\n", "}" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "r" } }, "source": [ "## TF Enrichment using top genes in mowgli dimensions" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "r" } }, "source": [ "We perform here a standard TF enrichment using the top features identifuied in the RNA space for each dimension of Mowgli.\n", "\n", "In this case example, we made use of the [Regulatory Circuits](https://doi.org/10.1038/nmeth.3799) database ([link](http://www2.unil.ch/cbg/regulatorycircuits/FANTOM5_individual_networks.tar)), but we recommend the users to choose the most appropriate TF->Target database according to his domain and prior biological information.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# Reload libraries\n", "\n", "library(stats)\n", "library(dplyr)\n", "library(stringr)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# R \n", "\n", "# reload GRN and make a TF-> Target list\n", "\n", "# network of epithelial cells\n", "grn.path <- \"Regulatory_circuits_mammary_epithelial_cell.txt\"\n", "grn <- read.table(grn.path, sep=\"\\t\", header = F)\n", "colnames(grn) <- c(\"TF\", \"Target\", \"score\")\n", "\n", "# make a TF -> Target list\n", "tf_list <- split(grn$Target, grn$TF)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# R\n", "\n", "# reload top features for RNA and reparse it\n", "top_feats.path <- \"top_genes_in_mowgli.csv\"\n", "top_feats <- read.table(top_feats.path, sep = \",\", header = T)\n", "# set row names to index\n", "row.names(top_feats) <- top_feats$hgnc_symbol\n", "\n", "cols_to_keep <- c(\"highly_variable\", grep(\"top_in_dim\", names(top_feats), value = TRUE))\n", "\n", "top_feats.filtered <- top_feats %>%\n", " select(all_of(cols_to_keep))\n", "\n", "top_feats.filtered <- top_feats.filtered %>%\n", " mutate(\n", " `highly_variable` = as.logical(ifelse(`highly_variable` == \"True\", TRUE, FALSE)),\n", " across(starts_with(\"top_in_dim\"), ~ as.logical(ifelse(. == \"True\", TRUE, FALSE)))\n", " )\n", "\n", "# define the universe -> in this case, a list of highly variable genes in the RNA slot\n", "universe <- readLines(\"highly_variable_genes.txt\")\n", "\n", "universe.len <- length(universe)\n" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "r" } }, "source": [ "In brief:\n", "- we loop through each dimension and we select for each dimension the top features\n", "- we loop through each TF (using a groupby) and we identify the top sets of features\n", "- we make a hypergeometric test \n", "- we calculate an enrichment score enrichment\n", "- we correct the pvalue using Benjamini-hochberg correction\n", "- we write the results to a file (only for significant TFs enriched)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "vscode": { "languageId": "r" } }, "outputs": [], "source": [ "# R\n", "\n", "# directory that will store the results\n", "res.dir <- \"tf_enrichment\"\n", "\n", "# colnames of the dataframe storing the results\n", "res.colnames <- c(\"TF\", \"number_of_targets\", \"enrichment_score\", \"p.val\", \"p.adjust\")\n", "\n", "# loop through all the top mowgli dimensions\n", "for (dim in names(top_feats.filtered)[grepl(\"^top_in\", names(top_feats.filtered))]) {\n", " dim_number <- str_extract(dim, \"\\\\d+$\")\n", " print(paste(\"enriching:\", dim_number))\n", " \n", " # select the top features for that dimensiob\n", " top_genes <- rownames(top_feats.filtered[top_feats.filtered[[dim]] == TRUE, ])\n", " top_genes.len <- length(top_genes) # should always be 100\n", " ratioDim <- top_genes.len / universe.len # number of genes in the universe, shoukd always be the same numbe, useful for the enrichment\n", "\n", " # open the output file\n", " output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n", " res.df <- data.frame(matrix(ncol= length(res.colnames), nrow = 0))\n", " colnames(res.df) <- res.colnames\n", " \n", " # define the list to store files \n", " p_values <- numeric()\n", " tfs <- character()\n", " n_targets <- numeric()\n", " enrichment_scores <- numeric()\n", "\n", " # loop through all the TFs and perform the enrichment\n", " for (tf in names(tf_list)){\n", " targets <- tf_list[[tf]]\n", " \n", " x <- length(intersect(targets, top_genes)) # white balls, i.e. how many genes in top dim are in the TF targets\n", " m <- length(intersect(universe, targets)) # number of white balls in the urn, i.e., how many targets are in the universe\n", " n <- universe.len - length(intersect(targets, universe)) # number of black balls in the urn, i.e. how many genes in the universe are NOT in the targets\n", " k <- top_genes.len # the size of the balls drawn, always 100 (the number of top genes in the dimension)\n", " \n", " p_value <- phyper(x, m, n, k, lower.tail = FALSE) #select as significantonly the over enriched\n", " n_targets_expressed <- length(intersect(targets, rownames(universe)))\n", " n_targets_feats <- x/universe.len\n", " ratioTargets <- ifelse(n_targets_expressed == 0, 0, x / n_targets_expressed)\n", " enrichment_score <- 1/ (ratioTargets / ratioDim)\n", " \n", " # Store results for FDR adjustment\n", " p_values <- c(p_values, p_value)\n", " tfs <- c(tfs, tf)\n", " n_targets <- c(n_targets, x)\n", " enrichment_scores <- c(enrichment_scores, enrichment_score)\n", " }\n", "\n", " # Adjust p-values using Benjamini-Hochberg correction\n", " adjusted_p_values <- p.adjust(p_values, method = \"BH\")\n", "\n", " # Combine results into a dataframe\n", " results <- data.frame(TF = tfs, number_of_targets = n_targets, enrichment_score = enrichment_scores, p_value = p_values, adjusted_p_value = adjusted_p_values)\n", " \n", " # Filter results for significant adjusted p-values\n", " significant_results <- results[results$adjusted_p_value <= 0.05, ]\n", " \n", " # Save significant results to file\n", " output_file_name <- file.path(res.dir, paste0(\"enriched_TFs_dim\", dim_number, \".tsv\"))\n", " write.table(significant_results, file = output_file_name, sep = \"\\t\", row.names = FALSE, col.names = TRUE, quote = FALSE)\n", "}" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.5 ('base')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "nbsphinx": { "execute": "never" } }, "nbformat": 4, "nbformat_minor": 2 }